Reetesh Virous
Fri Nov 11 2022A great infrastructure to continuously refine the dataset based on the latest model Now … this is the non obvious one, and the one I’ve been most passionate about. I call this the data loop. Andrej/Tesla has called this their data engine. We’ve also called this our data factory. Whatever you call it, this is the single most important thing to get right to enable a proper SW 2.0 environment. Why? Again: data is the source code of AI. This infrastructure needs to enable developers to build their datasets like they would their source code in traditional SW 1.0 programs. Nurture them, cherish th
em, debug them, continuously find issues, refine them, polish them…. Love them :). If you know how a SW developer cherishes their source code, how they care about formatting, elegance, simplicity… then you should expect a SW 2.0 dev to do the same exact thing with their dataset. If they don’t and datasets are an afterthought, then you have a big red flag — your org, program, culture, is prob not set up for success. Again: data is the source code of AI. At this point some companies completely internalized this, and are flying years ahead of the rest. So let’s look at this closer. How do you
get there? Two important points to internalize: Unlike source code, datasets cannot be written out of thin air. They need to be collected or synthetically generated. This is the most brutal differen
ce with SW 1.0. This is the very first thing you need to completely internalize, and not shy away from. Like source code, datasets need to be versioned, compiled (DL trained on it and produce some results), analyzed, optimized, refined, over time. When an issue gets found, the datasets need to be patched, that is, the faulty data needs to be removed, or corrected; and the missing data needs to be plugged (new relevant data collected and/or generated). Read these 2 points again. They are loaded. These imply that you will need to build a considerable amount of new infrastructure to enable this effectively. Weird infrastructure. SW 2.0 infrastructure. Unlike training infrastructure, there’s really not much there available out of the box. Companies who get it are building great infra internall
y to achieve this, for their own vertical problems. We’re going to need to see much more done in open source or via startups to plug this hole. So let’s process these 2 points, and turn them into actionable ingredients you need: It should be easy to acquire/collect data from the target distribution. So say you’re building a robot that will actuate on factory floors, and you have a given sensor set of a few cameras, some ultrasonics — you should be able to rapidly query and obtain full sensor recordings from a fleet of such robots, in their target environment. Some companies have unfair advantages there as they own consumer platforms/products, that already sit in that target environment. If you’re not there yet you need to find proxies or get there as fast as possible. It should be easy t
o synthetically generate data from the target distribution. If you’re limited in your ability to collect real data, then this is a must and most likely the way you can make progress until you solve your problem of acquiring real data at scale. This is an incredibly hard problem on its own, as generating synthetic data from the target distribution not only implies having a good simulator for that data (model of the world, model of the sensors) BUT also a good model of the target distribution… which often ends up requiring access to real data anyway. I think of this as complementary to real data. It should be easy to produce ground truth for your data. The data you acquired above is the input part of the {input, output} pairs — the ground truth is the output, what you actually want to predict. The most basic way to approach this is to throw human labelers at the problem — today you have great companies/offerings like Scale.ai providing such services. But this is where you need to get way
more creative. This process, as much as possible, needs to be automated, and leverage models to pre-label, as much as possible. In the case of AV, there’s so much you can do using larger models or other sources of data like HD maps to pre-label data and minimize the amount of manual labeling. Each domain has its own tricks to be found, to semi-automate this step as much as possible. It should be easy to analyze datasets for errors or gaps. Rapidly you’ll find that a core skill your org needs is to find errors in your datasets (the equivalent of bugs in source code), and fix them (re-label, or remove), as well as find gaps/holes, e.g. parts of the target distribution that are under-represented. This requires a blend of methods (e.g. using multiple trained models and look for divergence of opinions, or active learning techniques), as well as infrastructure to enable this to be easy. My team published several such techniques [here, here, and there]. This space is large and is still under
-explored. Possibly because it’s domain specific, company infra-specific and not easy for 3rd party/academics to explore. It should be easy to mine or curate data. Finally, and in support of the above, you need a way to mine data either directly at the source/edge (in the case of AV in the fleet), or cloud side (say you’ve over-collected, and want to target within that larger set). Mining cloud si
de is an easier way to get started — and in a way many web companies do that as well, log everything to start with, build a large data lake, and then have the machinery to query it later (spark, presto, etc.). But ultimately the ability to query at the edge/source is key, and I believe all AI applications will end up built that way, for scalability reasons, and for privacy reasons. There’s more than that, but these are the pillars I see, that have to get done right for anything else to work well. What’s next So is the problem solved? Are we as a community building everything that’s required to enable SW 2.0 to be developed effectively and by anyone across industries? Certainly not. I believe we have found the right pillars, the right way to frame SW 2.0 development, the right vocabulary (MLOps as an analogy to SW1.0 DevOps), and some companies are way ahead in enabling this type of SW 2.0 development loop, process and infrastructure. These need to mature up, and then get opened up so
more can benefit. We will get there in the next 5–10y. As for the method itself, SW 2.0 is ripe, works, and if you deploy all the right ingredients, you can solve amazingly complex problems (AV, robotics, chat bots, etc.) today. Now having said that, I believe the next 5y will yield new ingredients getting us closer to AGI, agents capable of leveraging big models (transformers trained on large
data), as well as memory, ability to offload, retain, organize thoughts, and up the game one more notch. This will utlimately simplify the development of such applications further and SW 2.0 will eat even more of the application logic. I’m excited about that, and I’m excited about seeing the current approach fully mature up. It’s an amazing time to be working on AI and its applications. To another great 5y!
5 Likes